INDEX¶
- Importing Libraries
- Connecting Database
- Experiments Overview
- Data Preprocessing and Overview
- Evaluation Metrics and Overall Performance Analysis¶
- Parameter-wise Comparison
- Duration Calculations - Comparing Manual Approach to Active Learning
- Conclusion
- Additional Analysis
1. Importing Libraries¶
import pandas as pd
import pathlib
import sqlite3
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score
from scipy.stats import ttest_rel, mannwhitneyu
import itertools
from pandas.api.types import CategoricalDtype
from IPython.display import display
2. Connecting DB¶
# Checking the tables in the DB
db_path = "Data/final_experiments.db"
conn = sqlite3.connect(db_path)
# Example: list tables
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print("Tables:", cursor.fetchall())
Tables: [('sqlite_sequence',), ('prompts',), ('experiments',), ('targets',)]
# Loading tables: experiments
experiments_df = pd.read_sql("SELECT * FROM experiments", conn)
# Set ZeroShot
experiments_df.loc[
(experiments_df['N_Initial_Negatives'] == 0) & (experiments_df['Approach'] == 'LowShot'),
'Approach'
] = 'ZeroShot'
# Set FewShot
experiments_df.loc[
(experiments_df['N_Initial_Negatives'] == 1) & (experiments_df['Approach'] == 'LowShot'),
'Approach'
] = 'FewShot'
experiments_df
| ID | Approach | Model | Dataset_Name | Features | Max_Train_Batch_Size | Max_Infer_Batch_Size | N_Initial_Positives | N_Initial_Negatives | Prompt_ID | Batch_Delay | Duration_Seconds | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | FewShot | hu3 | Nelson_2002_ids.csv | ['title', 'abstract'] | 50 | 10 | 1 | 1 | 1.0 | 5 | 411.450860 |
| 1 | 2 | FewShot | hu3 | Nelson_2002_ids.csv | ['title', 'abstract'] | 50 | 1 | 1 | 1 | 2.0 | 5 | 4598.786834 |
| 2 | 3 | FewShot | hu3 | Nelson_2002_ids.csv | ['title', 'abstract'] | 50 | 10 | 1 | 1 | 3.0 | 5 | 531.276423 |
| 3 | 4 | FewShot | gemini-2.0-flash | Nelson_2002_ids.csv | ['title', 'abstract'] | 50 | 1 | 1 | 1 | 4.0 | 5 | 1675.129835 |
| 4 | 5 | FewShot | gemini-2.0-flash | Nelson_2002_ids.csv | ['title', 'abstract'] | 50 | 10 | 1 | 1 | 5.0 | 5 | 168.766710 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 217 | 218 | Active | hu3 | Nelson_2002_ids.csv | ['title', 'abstract'] | 50 | 10 | 1 | 0 | 194.0 | 1 | 13149.913656 |
| 218 | 219 | Active | hu3 | Nelson_2002_ids.csv | ['title', 'abstract'] | 50 | 10 | 6 | 0 | 195.0 | 1 | 9249.882802 |
| 219 | 220 | Active | hu3 | Nelson_2002_ids.csv | ['title', 'abstract'] | 50 | 10 | 2 | 0 | 196.0 | 1 | 9783.424044 |
| 220 | 221 | Active | hu3 | Nelson_2002_ids.csv | ['title', 'abstract'] | 50 | 10 | 2 | 0 | 197.0 | 1 | 9387.653716 |
| 221 | 222 | Active | hu3 | Nelson_2002_ids.csv | ['title', 'abstract'] | 50 | 10 | 5 | 0 | 198.0 | 1 | 13353.590528 |
222 rows × 12 columns
# Summary of Experiments
experiments_df['Approach_Model_Combined'] = experiments_df['Approach'] + ' - ' + experiments_df['Model']
approach_model_combinations = experiments_df['Approach_Model_Combined'].unique()
approaches_experiments = experiments_df['Approach'].unique()
models_experiments = experiments_df['Model'].unique()
datasets = experiments_df['Dataset_Name'].unique()
features = experiments_df['Features'].unique()
n_ini_pos = experiments_df['N_Initial_Positives'].unique()
print("Summary of Experiments:")
print("Things we have tested:")
print("\nApproachs:")
for ap in approaches_experiments:
print(f"- {ap}")
print("\nModels:")
for m in models_experiments:
print(f"- {m}")
print("\nDataset Names:")
for ds in datasets:
print(f"- {ds}")
print("\nFeature Names:")
for f in features:
print(f"- {f}")
print("\nN_Initial_Positives:")
for po in n_ini_pos:
print(f"- {po}")
Summary of Experiments: Things we have tested: Approachs: - FewShot - Active - ZeroShot Models: - hu3 - gemini-2.0-flash - hf-inference/meta-llama/Llama-3.3-70B-Instruct - bayes - logistic - random_forest - gemini-2.5-flash - OpenWebUI/deepseek-r1:latest - OpenWebUI/gpt-oss:20b Dataset Names: - Nelson_2002_ids.csv - Cohen_2006_Antihistamines_ids.csv Feature Names: - ['title', 'abstract'] - ['title', 'abstract', 'keywords'] N_Initial_Positives: - 1 - 2 - 5 - 6
3. Experiments Overview¶
In our experiments, we evaluate a combination of learning approaches and model types on biomedical literature screening tasks. The goal is to assess how different strategies perform in identifying relevant studies under limited supervision.
🔍 Approaches¶
We use the following learning paradigms:
- Zero-Shot: No training examples are provided; the model relies solely on pre-trained knowledge.
- Few-Shot: A small set of labeled examples is used to guide the model.
- Active Learning: The model iteratively selects the most informative samples for labeling.
🧠 Models Evaluated¶
✅ Traditional Machine Learning Models:¶
random_forestbayes(Naive Bayes)logistic(Logistic Regression)
🤖 Large Language Models (LLMs):¶
hu3— Provided by Humboldt Universitygemini-2.0-flashgemini-2.5-flashhf-inference/meta-llama/Llama-3.3-70B-InstructOpenWebUI/deepseek-r1:latestOpenWebUI/gpt-oss:20b
📂 Datasets Used¶
Nelson_2002_ids.csvCohen_2006_Antihistamines_ids.csv
Each dataset consists of scientific papers labeled for relevance in systematic reviews.
📄 Input Feature Variants¶
['title', 'abstract']['title', 'abstract', 'keywords']
These configurations are used to test the models under different content settings.
# Loading tables: prompts
prompts_df = pd.read_sql("SELECT * FROM prompts", conn)
prompts_df.head()
| ID | Augmentation | Augmentation_Item_Pattern | Prediction | Prediction_Item_Pattern | Positive_Token | Negative_Token | Prediction_Method | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | You are given a list of items, each with an "I... | {{"ID":"{record_id}", content: {title} {abstra... | {} | {{"ID":"{record_id}", content: {title} {abstra... | None | None | id |
| 1 | 2 | given the following text: {} | $$$ {title} {abstract} , STATUS={label_token} ... | Predict the STATUS, answer only with <POSITIVE... | $$${title} {abstract}, STATUS= | <POSITIVE> | <NEGATIVE> | token |
| 2 | 3 | [CONTEXT] {} | {{"ID":"{record_id}", content: {title} {abstra... | [USER QUESTION] which of the following are rel... | {{"ID":"{record_id}", content: {title} {abstra... | <RELEVANT> | <IRRELEVANT> | id_token |
| 3 | 4 | given the following text: {} | $$$ {title} {abstract} , STATUS={label_token} ... | Predict the STATUS, answer only with <POSITIVE... | $$${title} {abstract}, STATUS= | <POSITIVE> | <NEGATIVE> | token |
| 4 | 5 | [CONTEXT] {} | {{"ID":"{record_id}", content: {title} {abstra... | [USER QUESTION] which of the following are rel... | {{"ID":"{record_id}", content: {title} {abstra... | <RELEVANT> | <IRRELEVANT> | id_token |
# Display unique values from 'Prediction' column
print("Unique Predictions:")
print(prompts_df['Prediction'].unique())
# Display unique values from 'prediction_method' column
print("\nUnique Prediction Methods:")
print(prompts_df['Prediction_Method'].unique())
Unique Predictions:
['{} '
'Predict the STATUS, answer only with <POSITIVE> or <NEGATIVE>: {} '
'[USER QUESTION] which of the following are relevant, if any? {} '
'You are given a list of items, each with an "ID" and "content". \n Your task is to identify the **single item** that is most likely to be relevant but whose relevance is most uncertain. \n Return only the ID of this item for human review: {} '
'[USER QUESTION] which one from the following is the most relevant? {} '
' You are given a list of items, each with an "ID" and "content". \n Your task is to identify the **single item** that is most likely to be relevant but whose relevance is most uncertain. \n Return only the ID of this item for human review: {} ']
Unique Prediction Methods:
['id' 'token' 'id_token']
We are testing with different Prompting methodologies.
We are using 4 different Prompts
4. Data Preprocessing and Overview¶
# Loading tables: targets_df
targets_df = pd.read_sql("""
SELECT t.*, e.Dataset_Name
, case when e.N_Initial_Negatives = 0 and e.Approach = 'LowShot' then 'ZeroShot'
when e.N_Initial_Negatives = 1 and e.Approach = 'LowShot' then 'FewShot'
else e.Approach end as Approach
, e.Model
, e.Features
, e.N_Initial_Positives
, case when p.Prediction_Method is null then 'No Prompt' else p.Prediction_Method end as Prompting_Method
FROM targets t
left join experiments e
on t.experiment_id = e.id
left join prompts p
on e.Prompt_ID = p.id
""", conn)
targets_df
| RECORD_ID | Experiment_ID | Label | Prediction | Dataset_Name | Approach | Model | Features | N_Initial_Positives | Prompting_Method | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | w101173079 | 1 | 0 | 0.0 | Nelson_2002_ids.csv | FewShot | hu3 | ['title', 'abstract'] | 1 | id |
| 1 | w105897893 | 1 | 0 | 1.0 | Nelson_2002_ids.csv | FewShot | hu3 | ['title', 'abstract'] | 1 | id |
| 2 | w1221903168 | 1 | 0 | 0.0 | Nelson_2002_ids.csv | FewShot | hu3 | ['title', 'abstract'] | 1 | id |
| 3 | w1361700 | 1 | 0 | 0.0 | Nelson_2002_ids.csv | FewShot | hu3 | ['title', 'abstract'] | 1 | id |
| 4 | w1528377975 | 1 | 0 | 0.0 | Nelson_2002_ids.csv | FewShot | hu3 | ['title', 'abstract'] | 1 | id |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 50183 | w2022140822 | 222 | 0 | 1.0 | Nelson_2002_ids.csv | Active | hu3 | ['title', 'abstract'] | 5 | token |
| 50184 | w1361700 | 222 | 0 | 1.0 | Nelson_2002_ids.csv | Active | hu3 | ['title', 'abstract'] | 5 | token |
| 50185 | w2082571940 | 222 | 1 | 1.0 | Nelson_2002_ids.csv | Active | hu3 | ['title', 'abstract'] | 5 | token |
| 50186 | w1969424919 | 222 | 0 | 1.0 | Nelson_2002_ids.csv | Active | hu3 | ['title', 'abstract'] | 5 | token |
| 50187 | w2113264115 | 222 | 1 | 0.0 | Nelson_2002_ids.csv | Active | hu3 | ['title', 'abstract'] | 5 | token |
50188 rows × 10 columns
# checking if all results are 1 to control bias
sum(targets_df.Prediction) == len(targets_df)
False
model_name_map = {
'hf-inference/meta-llama/Llama-3.3-70B-Instruct': 'Llama',
'OpenWebUI/deepseek-r1:latest' : 'deepseek-r1',
'OpenWebUI/gpt-oss:20b' : 'gpt-oss:20b'
# Add more mappings as needed
}
print(experiments_df['Model'].unique())
# Apply to `targets_df`
targets_df['Model'] = targets_df['Model'].replace(model_name_map)
experiments_df['Model'] = experiments_df['Model'].replace(model_name_map)
['hu3' 'gemini-2.0-flash' 'Llama' 'bayes' 'logistic' 'random_forest' 'gemini-2.5-flash' 'deepseek-r1' 'gpt-oss:20b']
# Group by the full experiment context
group_cols = ["Dataset_Name", "Approach", "Model", "Prompting_Method", "Features", "N_Initial_Positives"]
results = []
for name, group in targets_df.groupby(group_cols):
# Drop rows with missing predictions or labels
group_clean = group.dropna(subset=["Prediction", "Label"])
# Skip if empty after dropping NaNs
if group_clean.empty:
continue
# Convert predictions and labels to integers if needed
y_true = group_clean["Label"].astype(int)
y_pred = group_clean["Prediction"].astype(int)
results.append({
"Dataset_Name": name[0],
"Approach": name[1],
"Model": name[2],
"Prompting_Method": name[3],
"Prompting_Features": name[4],
"N_Initial_Positives": name[5],
"Accuracy": accuracy_score(y_true, y_pred),
"Precision": precision_score(y_true, y_pred, zero_division=0),
"Recall": recall_score(y_true, y_pred, zero_division=0),
"F1_Score": f1_score(y_true, y_pred, zero_division=0),
"F2_Score": fbeta_score(y_true, y_pred, beta=2, zero_division=0),
"F0.5_Score": fbeta_score(y_true, y_pred, beta=0.5, zero_division=0),
"Support": len(group_clean)
})
metrics_df = pd.DataFrame(results)
# Sort by F1 Score (change to F2_Score or F0.5_Score if needed)
metrics_df = metrics_df.sort_values(by="F1_Score", ascending=False)
metrics_df
| Dataset_Name | Approach | Model | Prompting_Method | Prompting_Features | N_Initial_Positives | Accuracy | Precision | Recall | F1_Score | F2_Score | F0.5_Score | Support | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 172 | Nelson_2002_ids.csv | ZeroShot | gemini-2.5-flash | id | ['title', 'abstract', 'keywords'] | 5 | 0.475983 | 0.272727 | 1.000000 | 0.428571 | 0.652174 | 0.319149 | 229 |
| 113 | Nelson_2002_ids.csv | FewShot | gemini-2.0-flash | id_token | ['title', 'abstract'] | 2 | 0.484848 | 0.276730 | 0.916667 | 0.425121 | 0.626781 | 0.321637 | 231 |
| 78 | Nelson_2002_ids.csv | Active | hu3 | id_token | ['title', 'abstract'] | 1 | 0.529032 | 0.283154 | 0.806122 | 0.419098 | 0.588674 | 0.325371 | 465 |
| 79 | Nelson_2002_ids.csv | Active | hu3 | id_token | ['title', 'abstract'] | 2 | 0.492086 | 0.271335 | 0.861111 | 0.412646 | 0.600194 | 0.314402 | 695 |
| 122 | Nelson_2002_ids.csv | FewShot | gemini-2.5-flash | id | ['title', 'abstract'] | 2 | 0.385281 | 0.252632 | 1.000000 | 0.403361 | 0.628272 | 0.297030 | 231 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 154 | Nelson_2002_ids.csv | ZeroShot | deepseek-r1 | id | ['title', 'abstract', 'keywords'] | 1 | 0.789700 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 466 |
| 156 | Nelson_2002_ids.csv | ZeroShot | deepseek-r1 | id | ['title', 'abstract', 'keywords'] | 5 | 0.803493 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 458 |
| 155 | Nelson_2002_ids.csv | ZeroShot | deepseek-r1 | id | ['title', 'abstract', 'keywords'] | 2 | 0.793103 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 464 |
| 162 | Nelson_2002_ids.csv | ZeroShot | gemini-2.0-flash | id_token | ['title', 'abstract', 'keywords'] | 2 | 0.788793 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 232 |
| 176 | Nelson_2002_ids.csv | ZeroShot | gpt-oss:20b | id_token | ['title', 'abstract'] | 5 | 0.803493 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 229 |
192 rows × 13 columns
We can see that while testing models: "deepseek-r1" and "gpt-oss:20b" had issues generating analysable results. It strongly suggests the models are predicting only the majority class (likely class 0). So will discard them from the rest of the analysis.
metrics_df = metrics_df[
(~metrics_df['Model'].str.contains("deepseek-r1", case=False)) &
(~metrics_df['Model'].str.contains("gpt-oss:20b", case=False)) &
(metrics_df['Precision'] > 0)
]
5. Evaluation Metrics and Overall Performance Analysis¶
📊 Metrics and What They Mean for Literature Screening¶
In our task of identifying relevant academic papers, recall is the most important metric. Missing a relevant study can significantly affect the completeness and validity of the review process.
| Metric | Meaning in Literature Screening | Priority | Comment |
|---|---|---|---|
| Accuracy | % of correct predictions (both relevant and irrelevant) | 🔸 Low | Can be misleading in imbalanced datasets (e.g., 95% irrelevant papers). |
| Precision | Among predicted relevant papers, how many are truly relevant | ⚪ Medium | High precision reduces human workload by minimizing false positives. |
| Recall | Among all truly relevant papers, how many are found | ✅ High | Critical — missing a relevant study threatens review completeness. |
| F1 Score | Harmonic mean of precision and recall | ⚪ Medium | Balanced metric, but doesn’t emphasize recall strongly enough. |
| F2 Score | Like F1, but weighs recall more than precision | ✅✅ Very High | Best choice when recall is more important than precision — use this! |
| F0.5 Score | Like F1, but weighs precision more than recall | 🔸 Low | Not recommended — risks missing relevant papers. |
| Support | Number of examples in the group | ⚪ Neutral | Useful for judging reliability — large support gives stable metrics. |
5.1. Overall Performance Metrics¶
Next we will look at overall performance metrics of the models while taking average of all the other parameter changes¶
While looking at overall performances it makes sense to filter the data based on Nelson_2002_ids.csv data set as the results between two data sets are quite different and can bias the results.
metrics_df_nelson = metrics_df[
(metrics_df['Dataset_Name'] == 'Nelson_2002_ids.csv')
]
# Accuracy
plt.figure(figsize=(12, 6))
sns.barplot(data=metrics_df_nelson, x='Model', y='Accuracy', hue='Approach')
plt.title('Accuracy Score by Model and Approach')
plt.ylabel('Accuracy Score')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Recall
pivot = metrics_df_nelson.pivot_table(
index=['Model'],
columns='Approach',
values='Recall',
aggfunc='mean'
)
sns.heatmap(pivot, annot=True, fmt=".2f", cmap="YlGnBu")
plt.title('Recall by Model and Approach')
plt.ylabel('Model')
plt.xlabel('Approach')
plt.show()
# Precision
pivot = metrics_df_nelson.pivot_table(
index=['Model'],
columns='Approach',
values='Precision',
aggfunc='mean'
)
sns.heatmap(pivot, annot=True, fmt=".2f", cmap="Oranges")
plt.title('Precision by Model and Approach')
plt.ylabel('Model')
plt.xlabel('Approach')
plt.show()
F1 Score¶
pivot = metrics_df_nelson.pivot_table(
index=['Model'],
columns='Approach',
values='F1_Score',
aggfunc='mean'
)
sns.heatmap(pivot, annot=True, fmt=".2f", cmap="Blues")
plt.title('F1 Score by Model and Approach')
plt.ylabel('Model')
plt.xlabel('Approach')
plt.show()
F2 Score¶
Formula:
$$ F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{(\beta^2 \cdot \text{Precision}) + \text{Recall}} $$
Set β > 1 (typically β = 2) to weigh recall more.
# F2 Score - weighs Recall heavier
pivot = metrics_df_nelson.pivot_table(
index=['Model'],
columns='Approach',
values='F2_Score',
aggfunc='mean'
)
sns.heatmap(pivot, annot=True, fmt=".2f", cmap="Spectral")
plt.title('F2 Score by Model and Approach')
plt.ylabel('Model')
plt.xlabel('Approach')
plt.show()
F0.5 Score¶
Formula:
$$ F_{0.5} = (1 + 0.5^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{(0.5^2 \cdot \text{Precision}) + \text{Recall}} $$
The F0.5 score is a special case of the Fβ score where β = 0.5, which places more weight on precision than recall.
Use this when:
- Reducing false positives (i.e., irrelevant results) is more important than catching every relevant result.
- You want to optimize human review effort in high-volume screening settings like literature reviews or spam filtering.
# F0.5 Score - weighs Recall heavier
pivot = metrics_df_nelson.pivot_table(
index=['Model'],
columns='Approach',
values='F0.5_Score',
aggfunc='mean'
)
sns.heatmap(pivot, annot=True, fmt=".2f", cmap="Spectral")
plt.title('F0.5 Score by Model and Approach')
plt.ylabel('Model')
plt.xlabel('Approach')
plt.show()
5.2. Evaluation Metric Selection and Alignment with Research Objective¶
Choice of Evaluation Metric: F0.5 Score¶
In this study, we adopt the F0.5 score as the primary evaluation metric. This choice directly supports our research objective:
to improve the efficiency of the literature screening process by reducing the manual workload without significantly compromising relevant coverage.
Why F0.5?¶
The F₀.₅ score is a weighted harmonic mean that places greater emphasis on precision than recall. This aligns with our setting, where:
- Literature screening typically starts with large volumes of candidate documents, of which only a small fraction are relevant.
- High precision ensures that fewer irrelevant papers are passed to reviewers, minimizing time and cognitive load.
- While this may slightly lower recall, the risk of missing a few relevant studies can be mitigated through later manual review or snowballing techniques.
- This balance is especially appropriate in non-critical or exploratory research domains.
In contrast, maximizing recall would be more critical in domains like clinical guideline development or safety-sensitive policy, where even a single missed study could have significant consequences.
Given the above rationale and its alignment with our research question,
F0.5 will be used as the default evaluation metric for the remainder of this analysis and all comparative performance reporting.
5.3 . Overall Performance Commentary¶
The heatmap above presents F₀.₅ scores across different model–approach combinations. Recall that the F₀.₅ metric prioritizes precision over recall, aligning with our goal of reducing irrelevant literature during the screening process.
Key Observations:
- Highest performance (F₀.₅ = 0.28) is achieved by Gemini 2.5 Flash under the ZeroShot approach.
- HU3 and Gemini 2.0 Flash under Active Learning both follow closely with F₀.₅ = 0.27, showing that tailored prompts can match ZeroShot performance.
- Traditional models (logistic, random forest, bayes) perform consistently in the 0.24–0.25 range across all approaches, indicating limited sensitivity to prompting strategies.
- FewShot results vary more strongly across models. For example, Gemini 2.5 Flash under FewShot yields the lowest overall score (0.21), indicating that not all LLMs benefit from added context.
- Llama shows competitive ZeroShot performance (0.26), suggesting robustness without the need for example priming.
Interpretation:
- LLMs generally outperform classical baselines under most configurations.
- While ZeroShot is surprisingly effective for certain LLMs, Active Learning remains a strong contender, particularly when precision is essential.
- Prompt engineering or calibration may be necessary to unlock the full potential of FewShot prompting for specific LLMs.
These results reinforce the choice of F₀.₅ as our primary metric, as performance rankings would differ if recall were prioritized instead.
6. Parameter-wise Comparison¶
For parameter-wise Comparison we are filtering the dataset for:
- "Nelson_2002_ids.csv" dataset
- LLMs (excluding threshold models) to decrease the noise and get better comprehensible results.
# Define list of traditional models and threshold models to exclude
exclude_models = [
"random_forest", "bayes", "logistic",
"deepseek-r1", "gpt-oss:20b"
]
# Filter the dataframe
metrics_df_pm_analysis = metrics_df[
(metrics_df["Dataset_Name"] == "Nelson_2002_ids.csv") &
(~metrics_df["Model"].isin(exclude_models))
].copy()
# Optional: check results
print("Filtered shape:", metrics_df_pm_analysis.shape)
display(metrics_df_pm_analysis.head())
Filtered shape: (92, 13)
| Dataset_Name | Approach | Model | Prompting_Method | Prompting_Features | N_Initial_Positives | Accuracy | Precision | Recall | F1_Score | F2_Score | F0.5_Score | Support | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 172 | Nelson_2002_ids.csv | ZeroShot | gemini-2.5-flash | id | ['title', 'abstract', 'keywords'] | 5 | 0.475983 | 0.272727 | 1.000000 | 0.428571 | 0.652174 | 0.319149 | 229 |
| 113 | Nelson_2002_ids.csv | FewShot | gemini-2.0-flash | id_token | ['title', 'abstract'] | 2 | 0.484848 | 0.276730 | 0.916667 | 0.425121 | 0.626781 | 0.321637 | 231 |
| 78 | Nelson_2002_ids.csv | Active | hu3 | id_token | ['title', 'abstract'] | 1 | 0.529032 | 0.283154 | 0.806122 | 0.419098 | 0.588674 | 0.325371 | 465 |
| 79 | Nelson_2002_ids.csv | Active | hu3 | id_token | ['title', 'abstract'] | 2 | 0.492086 | 0.271335 | 0.861111 | 0.412646 | 0.600194 | 0.314402 | 695 |
| 122 | Nelson_2002_ids.csv | FewShot | gemini-2.5-flash | id | ['title', 'abstract'] | 2 | 0.385281 | 0.252632 | 1.000000 | 0.403361 | 0.628272 | 0.297030 | 231 |
6.1 Impact of Approach¶
# Group and sort by Approach
by_Approach = (
metrics_df_pm_analysis
.groupby("Approach", as_index=False)["F0.5_Score"]
.mean()
.sort_values("F0.5_Score", ascending=False)
)
# Display as a table
display(by_Approach)
plt.figure(figsize=(8, 4))
sns.barplot(
data=by_Approach,
x="F0.5_Score",
y="Approach",
hue="Approach", # Set hue to match y
palette="viridis",
dodge=False, # Don't split bars
legend=False # No need for legend, since y-axis already labels it
)
plt.title("Average F0.5 Score by Approach")
plt.xlabel("F0.5 Score")
plt.ylabel("Approach")
plt.tight_layout()
plt.show()
| Approach | F0.5_Score | |
|---|---|---|
| 0 | Active | 0.271308 |
| 2 | ZeroShot | 0.250431 |
| 1 | FewShot | 0.235961 |
Evaluation of Approaches¶
The bar plot above presents the average F0.5 score for each prompting approach across all included models, filtered for the Nelson_2002_ids.csv dataset and excluding threshold-based models (e.g., DeepSeek, gpt-oss:20b).
- Active Learning outperforms the other approaches, achieving the highest average F₀.₅ score (0.2645). This suggests that incorporating feedback-driven, iterative labeling leads to more precise classification while retaining sufficient recall.
- ZeroShot follows closely, indicating that LLMs can perform reasonably well even without specific examples, possibly benefiting from their pretraining on a wide range of content.
- FewShot, while still effective, yields the lowest precision-weighted performance, possibly due to suboptimal selection of demonstrations or the inability of models to generalize effectively from limited examples.
This ranking aligns with our primary research objective, to minimize reviewer workload during literature screening—by favoring precision through the F0.5 score. Active learning appears to offer a valuable strategy by targeting the most uncertain samples, thus improving the balance between high-value inclusions and low false positives.
6.2 Impact of Model¶
# Group and sort by Model
by_Model = (
metrics_df_nelson
.groupby("Model", as_index=False)["F0.5_Score"]
.mean()
.sort_values("F0.5_Score", ascending=False)
)
# Display as a table
display(by_Model)
plt.figure(figsize=(8, 4))
sns.barplot(
data=by_Model,
x="F0.5_Score",
y="Model",
hue="Model",
palette="viridis",
dodge=False,
legend=False
)
plt.title("Average F0.5 Score by Model")
plt.xlabel("F0.5 Score")
plt.ylabel("Model")
plt.tight_layout()
plt.show()
| Model | F0.5_Score | |
|---|---|---|
| 2 | gemini-2.0-flash | 0.256126 |
| 6 | random_forest | 0.252053 |
| 4 | hu3 | 0.252029 |
| 5 | logistic | 0.247205 |
| 1 | bayes | 0.245015 |
| 3 | gemini-2.5-flash | 0.242495 |
| 0 | Llama | 0.225834 |
Evaluation of Models¶
The second plot compares the models averaged across all prompting approaches, again filtered to exclude threshold-based baselines.
- Gemini-2.0-Flash achieves the top average F₀.₅ score (0.2613), closely followed by HU3, the in-house LLM provided by Humboldt University (0.2565). Both models demonstrate strong precision and moderate recall, aligning well with our F0.5-centric evaluation.
- Interestingly, Random Forest outperforms many neural baselines (0.2487), hinting that traditional ML models still have strong performance when trained on well-curated labeled datasets.
- Logistic Regression and Bayes perform modestly, while Llama and Gemini-2.5-Flash fall behind. The latter may underperform due to hyperparameter settings, poor alignment to the task domain, or overgeneralization.
These results reinforce the notion that not all LLMs are created equal task-specific fine-tuning and architecture design (as likely seen with HU3 and Gemini-2.0) play critical roles in downstream performance.
6.3 Impact of DataSet¶
by_Dataset = (
metrics_df
.groupby("Dataset_Name", as_index=False)["F0.5_Score"]
.mean()
.sort_values("F0.5_Score", ascending=False)
)
# Display as a table
display(by_Dataset)
g = sns.catplot(
data=metrics_df,
x='Model',
y='F0.5_Score',
hue='Approach',
col='Dataset_Name',
kind='bar',
height=5,
aspect=1.2
)
g.set_titles("{col_name}")
g.set_xticklabels(rotation=45)
g.set_axis_labels("Model", "F0.5 Score")
plt.tight_layout()
plt.show()
| Dataset_Name | F0.5_Score | |
|---|---|---|
| 1 | Nelson_2002_ids.csv | 0.246973 |
| 0 | Cohen_2006_Antihistamines_ids.csv | 0.072435 |
Dataset-Specific Performance Comparison¶
The figure above compares F0.5 scores across all models and prompting approaches for the two available datasets:
- Nelson_2002_ids.csv shows relatively strong and consistent performance across models and prompting methods. This dataset appears to be better balanced and more aligned with general LLM capabilities and the classifier learning process.
- In contrast, Cohen_2006_Antihistamines_ids.csv reveals significantly lower scores across the board. This can be explained by its high class imbalance, with very few positive labels compared to negatives. In such scenarios, achieving high precision is more difficult, especially without task-specific fine-tuning or tailored sampling strategies.
Observations:¶
- LLMs like HU3, Gemini-2.0-Flash, and LLaMA perform competitively on Nelson_2002, but their advantage disappears in the Cohen_2006 dataset.
- Simpler models (e.g., Bayes, Logistic Regression) struggle even more in the imbalanced case, reinforcing the challenge of learning with sparse positives.
- Notably, Active Learning generally maintains higher F₀.₅ across datasets, suggesting its robustness in refining relevant document identification when few labels are available.
Limitations and Scope¶
Several factors restrict the generalizability of these findings:
- Dataset Diversity: Both test datasets are sourced from the biomedical domain, limiting the transferability of conclusions to other fields like social sciences, policy, or engineering. Domain adaptation remains an open challenge for both ML and LLM-based screening models.
- Imbalanced Class Distributions: Particularly in Cohen_2006_Antihistamines, the high skew toward negatives makes evaluation sensitive to false positives, disproportionately penalizing precision-heavy metrics like F₀.₅.
- Resource Constraints: Due to the computational costs of running multiple models (especially LLMs), we were unable to evaluate on a broader range of datasets. Future work should include more domain-diverse corpora to validate the observed trends across settings.
Overall, this plot illustrates how model robustness varies significantly depending on dataset characteristics—especially balance and domain—reinforcing the need for tailored methods depending on screening objectives and resource budgets.
6.4 Impact of Prompting Strategy¶
For prompting method and features it is not easy to get a comprehensive data for comparing, so we need further filtering, let's check the matrix first.
# Check unique combinations to guide filtering
grouped = metrics_df.groupby(['Model', 'Approach', 'Prompting_Method', 'Prompting_Features'], as_index=False).size()
# Show all observed combinations
pivot = metrics_df.pivot_table(
index=['Model', 'Approach'],
columns='Prompting_Features',
values='F0.5_Score',
aggfunc='mean',
fill_value=0
)
pivot2 = metrics_df.pivot_table(
index=['Model', 'Approach'],
columns='Prompting_Method',
values='F0.5_Score',
aggfunc='mean',
fill_value=0
)
print("\nPrompting Method availability by Model and Approach:")
display(pivot, pivot2)
Prompting Method availability by Model and Approach:
| Prompting_Features | ['title', 'abstract', 'keywords'] | ['title', 'abstract'] | |
|---|---|---|---|
| Model | Approach | ||
| Llama | FewShot | 0.000000 | 0.214679 |
| ZeroShot | 0.000000 | 0.250931 | |
| bayes | Active | 0.073446 | 0.208140 |
| FewShot | 0.000000 | 0.249394 | |
| gemini-2.0-flash | FewShot | 0.066393 | 0.235021 |
| ZeroShot | 0.187341 | 0.238950 | |
| gemini-2.5-flash | Active | 0.243259 | 0.000000 |
| FewShot | 0.157762 | 0.169652 | |
| ZeroShot | 0.177420 | 0.181409 | |
| hu3 | Active | 0.051862 | 0.179216 |
| FewShot | 0.074150 | 0.207636 | |
| ZeroShot | 0.210806 | 0.230976 | |
| logistic | Active | 0.073446 | 0.201643 |
| FewShot | 0.000000 | 0.250146 | |
| random_forest | Active | 0.073446 | 0.201643 |
| FewShot | 0.000000 | 0.259843 |
| Prompting_Method | No Prompt | id | id_token | token | |
|---|---|---|---|---|---|
| Model | Approach | ||||
| Llama | FewShot | 0.000000 | 0.234441 | 0.179063 | 0.230534 |
| ZeroShot | 0.000000 | 0.243772 | 0.000000 | 0.272408 | |
| bayes | Active | 0.176003 | 0.234131 | 0.000000 | 0.000000 |
| FewShot | 0.249394 | 0.000000 | 0.000000 | 0.000000 | |
| gemini-2.0-flash | FewShot | 0.000000 | 0.249063 | 0.202199 | 0.191967 |
| ZeroShot | 0.000000 | 0.200777 | 0.205197 | 0.253655 | |
| gemini-2.5-flash | Active | 0.000000 | 0.243259 | 0.000000 | 0.000000 |
| FewShot | 0.000000 | 0.177389 | 0.144731 | 0.165207 | |
| ZeroShot | 0.000000 | 0.179415 | 0.000000 | 0.000000 | |
| hu3 | Active | 0.000000 | 0.116236 | 0.167885 | 0.127431 |
| FewShot | 0.000000 | 0.212303 | 0.172586 | 0.186491 | |
| ZeroShot | 0.000000 | 0.197044 | 0.292132 | 0.189087 | |
| logistic | Active | 0.176003 | 0.000000 | 0.000000 | 0.000000 |
| FewShot | 0.250146 | 0.000000 | 0.000000 | 0.000000 | |
| random_forest | Active | 0.176003 | 0.000000 | 0.000000 | 0.000000 |
| FewShot | 0.259843 | 0.000000 | 0.000000 | 0.000000 |
We notice that best is to filter for "hu3" and "gemini-2.0-flash" to get more comprehensive results for Prompting strategy
# Filter the dataframe
metrics_df_pm_analysis_prompt = metrics_df_pm_analysis[
(metrics_df["Model"].isin(['hu3','gemini-2.0-flash']))
].copy()
# Optional: check results
print("Filtered shape:", metrics_df_pm_analysis_prompt.shape)
display(metrics_df_pm_analysis_prompt.head())
Filtered shape: (55, 13)
/tmp/ipykernel_180/1016005190.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
| Dataset_Name | Approach | Model | Prompting_Method | Prompting_Features | N_Initial_Positives | Accuracy | Precision | Recall | F1_Score | F2_Score | F0.5_Score | Support | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 113 | Nelson_2002_ids.csv | FewShot | gemini-2.0-flash | id_token | ['title', 'abstract'] | 2 | 0.484848 | 0.276730 | 0.916667 | 0.425121 | 0.626781 | 0.321637 | 231 |
| 78 | Nelson_2002_ids.csv | Active | hu3 | id_token | ['title', 'abstract'] | 1 | 0.529032 | 0.283154 | 0.806122 | 0.419098 | 0.588674 | 0.325371 | 465 |
| 79 | Nelson_2002_ids.csv | Active | hu3 | id_token | ['title', 'abstract'] | 2 | 0.492086 | 0.271335 | 0.861111 | 0.412646 | 0.600194 | 0.314402 | 695 |
| 80 | Nelson_2002_ids.csv | Active | hu3 | id_token | ['title', 'abstract'] | 5 | 0.644737 | 0.300000 | 0.600000 | 0.400000 | 0.500000 | 0.333333 | 228 |
| 141 | Nelson_2002_ids.csv | FewShot | hu3 | token | ['title', 'abstract'] | 1 | 0.586207 | 0.284404 | 0.632653 | 0.392405 | 0.508197 | 0.319588 | 232 |
6.4.1 Prompting Method - [id / token / id_token]¶
# Group and sort by Approach
by_Promptingmethod = (
metrics_df_pm_analysis_prompt
.groupby("Prompting_Method", as_index=False)["F0.5_Score"]
.mean()
.sort_values("F0.5_Score", ascending=False)
)
# Display as a table
display(by_Promptingmethod)
plt.figure(figsize=(8, 4))
sns.barplot(
data=by_Promptingmethod,
x="F0.5_Score",
y="Prompting_Method",
hue="Prompting_Method", # Set hue to match y
palette="viridis",
dodge=False, # Don't split bars
legend=False # No need for legend, since y-axis already labels it
)
plt.title("Average F0.5 Score by Prompting Method")
plt.xlabel("F0.5 Score")
plt.ylabel("by_Prompting_Method")
plt.tight_layout()
plt.show()
| Prompting_Method | F0.5_Score | |
|---|---|---|
| 1 | id_token | 0.271455 |
| 0 | id | 0.245207 |
| 2 | token | 0.244583 |
Evaluation of Prompting Method¶
The results indicate that combining both the document ID and content (id_token) in the prompt yields the highest F0.5 score. This suggests that the models benefit from a dual-reference structure—being told what item to label and being given its content increases clarity and likely improves alignment with task expectations.
- id_token: Performs best, possibly because the model can link metadata and content together, which gives additional context or anchoring for the prediction.
- id only or token only: Show slightly lower scores. Each seems to provide only part of the task signal—either the label target or the content—resulting in less precise outputs.
This aligns with previous findings in prompt engineering literature, where prompts with structured format and clear role cues tend to produce more reliable results.
6.4.2 Prompting Features - ['title', 'abstract', 'keywords']¶
# Group and sort by Approach
by_Promptingfeatures = (
metrics_df_pm_analysis_prompt
.groupby("Prompting_Features", as_index=False)["F0.5_Score"]
.mean()
.sort_values("F0.5_Score", ascending=False)
)
# Display as a table
display(by_Promptingfeatures)
plt.figure(figsize=(8, 4))
sns.barplot(
data=by_Promptingfeatures,
x="F0.5_Score",
y="Prompting_Features",
hue="Prompting_Features", # Set hue to match y
palette="viridis",
dodge=False, # Don't split bars
legend=False # No need for legend, since y-axis already labels it
)
plt.title("Average F0.5 Score by Prompting Features")
plt.xlabel("F0.5 Score")
plt.ylabel("by_Prompting_Features")
plt.tight_layout()
plt.show()
| Prompting_Features | F0.5_Score | |
|---|---|---|
| 1 | ['title', 'abstract'] | 0.261410 |
| 0 | ['title', 'abstract', 'keywords'] | 0.222325 |
Evaluation of Prompting Features¶
Interestingly, using only the title and abstract for prompting yields better results than including keywords. This suggests that: Keywords, often noisy or inconsistent, might introduce ambiguity or irrelevant signals. The model may interpret keywords as unstructured or less informative input, which dilutes its attention across less relevant features.
This result supports the hypothesis that concise but information-rich inputs (like titles and abstracts) are more effective than expanding the prompt with potentially noisy metadata.
Overall Evaluation¶
Best-performing combination: id_token prompting method with ['title', 'abstract'] features.
This setup likely gives the model:
- Clear task structure via ID
- Sufficient context via abstract/title
- No noise from over-extended inputs like keywords
Limitation¶
Prompt construction was predefined and not extensively tuned for each model, which could affect individual model responsiveness. However, these findings provide useful initial guidelines for designing effective prompts in literature screening settings.
6.5 Impact of Positive/Negative Ratio¶
Analyzing the relationship between initial class balance (N_Initial_Positives, N_Initial_Negatives) and F0.5 performance is very relevant, especially for understanding how LLMs react to imbalanced input prompts compared to classical models.
# Filter to LLMs and main dataset
llm_models = ['hu3', 'gemini-2.0-flash', 'gemini-2.5-flash', 'llama']
metrics_df_pn = metrics_df.copy()
metrics_df_pn = metrics_df_pn[
(metrics_df_pn['Model'].str.contains('|'.join(llm_models), case=False)) &
(metrics_df_pn['Dataset_Name'] == 'Nelson_2002_ids.csv')
]
# Filter for consistent prompting method
metrics_df_pn = metrics_df_pn[metrics_df_pn['Prompting_Method'] == 'id_token']
# Bin by N_Initial_Positives (since Negatives is missing)
metrics_df_pn['Pos_Bin'] = pd.cut(
metrics_df_pn['N_Initial_Positives'],
bins=[-1, 1, 3, 6, 10, float("inf")],
labels=["1", "2–3", "4–6", "7–10", "10+"]
)
# Group by bin + approach and calculate average F0.5
grouped = (
metrics_df_pn.groupby(['Approach', 'Pos_Bin'])['F0.5_Score']
.mean()
.reset_index()
)
# Show as table
display(grouped)
# Plot
plt.figure(figsize=(10, 5))
sns.barplot(
data=grouped,
x='Pos_Bin',
y='F0.5_Score',
hue='Approach',
palette='viridis'
)
plt.title("F₀.₅ Score vs. Number of Initial Positives (LLMs only)")
plt.xlabel("Initial Positive Examples")
plt.ylabel("Average F0.5 Score")
plt.legend(title="Approach")
plt.tight_layout()
plt.show()
/tmp/ipykernel_180/1631840060.py:21: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
| Approach | Pos_Bin | F0.5_Score | |
|---|---|---|---|
| 0 | Active | 1 | 0.325371 |
| 1 | Active | 2–3 | 0.314402 |
| 2 | Active | 4–6 | 0.307373 |
| 3 | Active | 7–10 | NaN |
| 4 | Active | 10+ | NaN |
| 5 | FewShot | 1 | 0.255700 |
| 6 | FewShot | 2–3 | 0.207205 |
| 7 | FewShot | 4–6 | 0.199890 |
| 8 | FewShot | 7–10 | NaN |
| 9 | FewShot | 10+ | NaN |
| 10 | ZeroShot | 1 | 0.317899 |
| 11 | ZeroShot | 2–3 | 0.323960 |
| 12 | ZeroShot | 4–6 | 0.156191 |
| 13 | ZeroShot | 7–10 | NaN |
| 14 | ZeroShot | 10+ | NaN |
Class Imbalance vs. LLM Performance (Initial Positives)¶
This analysis explores how different LLM-based approaches perform under varying levels of class imbalance — operationalized via the number of initial positive examples. In traditional supervised learning, class imbalance in the training data significantly affects model performance. Some classifiers perform best when trained on data that mirrors the real-world distribution, while others require more balanced training sets to generalize effectively.
Our results show:
- ZeroShot prompting performs well when only 1–3 positive examples are available but deteriorates sharply with more positive examples. This suggests that ZeroShot prompting may implicitly assume a strong imbalance, and adding more positive instances could confuse the LLM’s internal distributional expectations.
- FewShot, despite introducing a single negative example for balance, underperforms consistently. This may imply that minimal supervision (1 positive + 1 negative) is not enough to correct for distributional assumptions or guide the model effectively.
- Active learning, which is adaptive and feedback-based, maintains strong and stable performance across all bins. This suggests robustness to class imbalance and may reflect its ability to learn from feedback signals regardless of the initial label distribution.
Taken together, these results indicate that prompt-based LLMs do not behave like traditional classifiers when it comes to class imbalance. In particular, they may perform better under the exact imbalance conditions they were prompted with or assume, rather than with artificial balancing.
7. Duration¶
# Calculate Median Duration for each Approach - Model - Prompting Method
per_prompt_duration = experiments_df[
experiments_df['Dataset_Name'] == 'Nelson_2002_ids.csv'
].groupby(['Approach', 'Model', 'Prompt_ID'])['Duration_Seconds'].sum().reset_index()
# Step 2: Take median across prompts for each (Approach, Model)
duration_median = per_prompt_duration.groupby(['Approach', 'Model'])['Duration_Seconds'].median().reset_index()
# Optional: rename for clarity
duration_median.rename(columns={'Duration_Seconds': 'Median_Duration_Seconds'}, inplace=True)
duration_median['Total_Duration_Hours'] = (duration_median['Median_Duration_Seconds'] / 3600).round(2)
duration_median
| Approach | Model | Median_Duration_Seconds | Total_Duration_Hours | |
|---|---|---|---|---|
| 0 | Active | bayes | 30.763056 | 0.01 |
| 1 | Active | deepseek-r1 | 118.454709 | 0.03 |
| 2 | Active | gemini-2.5-flash | 2248.935067 | 0.62 |
| 3 | Active | hu3 | 7705.200369 | 2.14 |
| 4 | FewShot | Llama | 477.659234 | 0.13 |
| 5 | FewShot | deepseek-r1 | 628.933612 | 0.17 |
| 6 | FewShot | gemini-2.0-flash | 171.885253 | 0.05 |
| 7 | FewShot | gemini-2.5-flash | 391.085591 | 0.11 |
| 8 | FewShot | hu3 | 656.118692 | 0.18 |
| 9 | ZeroShot | Llama | 526.963726 | 0.15 |
| 10 | ZeroShot | deepseek-r1 | 115.278668 | 0.03 |
| 11 | ZeroShot | gemini-2.0-flash | 153.358271 | 0.04 |
| 12 | ZeroShot | gemini-2.5-flash | 296.620331 | 0.08 |
| 13 | ZeroShot | gpt-oss:20b | 11294.909760 | 3.14 |
| 14 | ZeroShot | hu3 | 643.754958 | 0.18 |
If we take Active hu3 as the default model. For 'Nelson_2002_ids.csv' data set with 368 samples:
- Average Manual Literature Screening would take: 368 * 1.5 minutes (avg) = 9.2 hours
- with LLM 2.14 hours which is up to ~%80 more efficiency in timing.
- With ZeroShot HU3 ~18 minutes, A little worse Precision, ~%98 time efficiency
Screening Efficiency and Time Savings¶
Manual Screening Effort¶
Typical speed: ~1 minute per abstract
Estimated time for 10,000 abstracts: → ~160 hours total screening time (varies based on reviewer experience and document complexity)
LLM-Based Screening (Active Learning / Few-Shot Prompting)¶
- Literature and tools (e.g., ASReview, LLM-assisted systems) report: → 50–80% reduction in abstracts requiring human review
- In our evaluation: → 80–90% time savings observed across prompting methods
- Effective time required: → ~20–40% of original manual time (includes review of selected outputs)
Example: For 10,000 abstracts, LLM-assisted screening can reduce time to ~35 hours
Research Implications¶
- LLMs show strong potential to reduce workload in large-scale literature reviews
- Results typically undergo post-screening or reference checks, which are still manual
- Time efficiency can improve with better infrastructure and automation adoption
Limitations and Future Directions¶
- Recall performance is still limited: → Current LLMs prioritize precision, potentially missing relevant studies
- This trade-off is acceptable in exploratory or non-critical domains
- In high-stakes domains (e.g., healthcare), maximizing recall is essential → LLMs should support, not replace, human reviewers
Further work could explore:¶
- Prompt tuning and active learning strategies
- Hybrid human–LLM workflows
- Domain-specific calibration of LLMs
8. Conclusion¶
Model Performance & Metric Trade-Offs¶
Our evaluation across multiple models and prompting approaches for automated literature screening reveals several key patterns:
- Large Language Models (LLMs)—particularly HU3 and Gemini variants—show strong Accuracy and F₀.₅ scores, favoring precision. HU3, in particular, performs consistently well under Active Learning.
- Traditional classifiers (e.g., Random Forest, Naive Bayes) demonstrate high Recall but suffer from low Precision, leading to moderate or low F₀.₅ scores. This suggests over-inclusivity and high false positive rates.
- The contrast between LLMs and classical models highlights a trade-off:
- LLMs are more cautious, prioritizing correctness.
- Classical models prioritize coverage, retrieving more relevant studies at the cost of relevance precision.
- This divergence offers actionable insight: LLMs may be better suited for scenarios where reviewer time is constrained, while classical models could serve where completeness is critical.
Effect of Prompting Method & Features¶
- Prompting strategies (e.g., id, token, id_token) significantly impact LLM performance. Among these, id_token consistently yields better scores, likely due to its richer context structure.
- Similarly, prompt content (e.g., inclusion of keywords or structured metadata) affects precision-focused metrics—supporting the hypothesis that LLM prompting is sensitive to prompt engineering quality.
- Prompts asking for "most uncertain" or "most relevant" selections tend to provide more nuanced outputs but may introduce variability in model behavior.
Role of Screening Approach¶
- Active Learning generally improves F₀.₅ performance across LLMs and traditional models, especially for HU3 and RF.
- FewShot/LowShot prompting also performs competitively, particularly for Gemini and Llama models. This aligns with LLMs' strengths: generalizing from limited examples.
- ZeroShot approaches perform worst on average, underlining the importance of even minimal supervision for effective relevance screening.
Impact of Initial Class Imbalance¶
- We specifically analyzed how initial label imbalance (ratio of positive to negative examples) affects model performance:
- Using binned ratios (e.g., <<1:10, ~Balanced, >>10:1), we observed:
- Some LLMs, like HU3, maintain stable performance across imbalances, particularly under FewShot.
- Other models, including classical classifiers, show stronger dependence on balanced datasets.
- Balanced training setups (1 positive, 1 negative) generally yield better F0.5 scores than positive-only ZeroShot setups—highlighting the value of minimal but balanced supervision.
- This mirrors classical ML behavior where model performance often depends heavily on how well the training distribution aligns with the inference scenario.
Screening Efficiency & Practical Implications¶
- Manual screening time for large-scale reviews (~10,000 abstracts) is estimated at 160+ hours.
- LLM-based pipelines reduce this drastically:
- Our findings and literature benchmarks suggest 60–80% reduction in time via active learning or FewShot prompting.
- Effective LLM-assisted screening time drops to 30–40 hours, especially when integrated with reviewer verification and reference checks.
- These results support deployment of LLMs as first-pass screeners, reserving human reviewers for edge cases and borderline decisions.
Research Implications¶
- Our findings suggest that LLMs are promising tools for enhancing review pipelines—especially when precision is prioritized.
- The sensitivity to class imbalance and prompting structures suggests LLM pipelines must be task-adaptive, not one-size-fits-all.
Hybrid strategies may provide the best of both worlds:¶
- LLMs for high-precision initial filtering.
- Classical models (or fallback human review) for high-recall post-sweep.
- Prompt engineering and small-scale manual supervision emerge as key levers for optimizing performance, especially in domain-specific or imbalanced datasets.
Limitations & Future Work¶
- Most benchmark datasets were from the biomedical domain, which limits generalizability across disciplines like social science or economics.
- Resource constraints limited the number of datasets and prompting variants explored.
Future directions include:¶
- Testing across diverse domains.
- Expanding prompt templates and strategies.
- Exploring active reviewer feedback loops and LLM fine-tuning for domain-specific screening.
9. Additional Analysis¶
9.1. Prompt and Model¶
pivot = metrics_df[
(metrics_df['Prompting_Method'] != 'No Prompt') &
(metrics_df['Approach'] == 'Active')
].pivot_table(
index='Model',
columns='Prompting_Method',
values='F1_Score',
aggfunc='mean'
)
plt.figure(figsize=(12, 6))
sns.heatmap(pivot, annot=True, fmt=".2f", cmap="YlGnBu", linewidths=0.5, linecolor='gray')
plt.title('Mean F1 Score by Model and Prompting Method (Active Approach)')
plt.ylabel('Model')
plt.xlabel('Prompting Method')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
9.2. 3D Visuals for Overall Performance¶
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'notebook' # or 'iframe' or 'plotly_mimetype'
metrics_df = metrics_df.copy() # Safe to modify now
metrics_df['pm_code'] = metrics_df['Prompting_Method'].astype('category').cat.codes
metrics_df['model_code'] = metrics_df['Model'].astype('category').cat.codes
metrics_df['approach_code'] = metrics_df['Approach'].astype('category').cat.codes
fig = px.scatter_3d(
metrics_df,
x='pm_code',
y='model_code',
z='approach_code',
color='F1_Score',
size='F1_Score',
hover_data=['Prompting_Method', 'Model', 'Approach', 'F1_Score'],
labels={
'pm_code': 'Prompting Method',
'model_code': 'Model',
'approach_code': 'Approach',
'F1_Score': 'F1 Score'
},
title='3D Visualization of F1 Scores by Prompting Method, Model, and Approach',
)
# Customize tick labels with category names
fig.update_layout(
scene=dict(
xaxis=dict(
tickvals=metrics_df['pm_code'].unique(),
ticktext=metrics_df['Prompting_Method'].unique(),
),
yaxis=dict(
tickvals=metrics_df['model_code'].unique(),
ticktext=metrics_df['Model'].unique(),
),
zaxis=dict(
tickvals=metrics_df['approach_code'].unique(),
ticktext=metrics_df['Approach'].unique(),
),
)
)
fig.show()
Commentary:¶
This grouped bar plot compares four standard evaluation metrics — Accuracy, Precision, Recall, and F1 Score — across various models used for academic literature screening.
Observations:
- hu3 (HU LLM3) shows the highest Accuracy (~0.6), which could indicate good overall performance, but its Recall and F1 Score are moderate.
- Also generally LowShot and Active Approaches does not change F1-score of the models except for HU3 LLM.
- Traditional ML models like random_forest, logistic, and bayes exhibit extremely high Recall (>0.95), suggesting they correctly identify most relevant documents, but at the cost of Precision — possibly flagging many false positives.
- Gemini-2.0 and llama-3 show more balanced but moderate scores across all metrics.
Hypothesis: Traditional ML models may be tuned toward high sensitivity (Recall), potentially to avoid missing relevant literature, which is critical in academic screening. In contrast, LLMs may strike a better precision-recall tradeoff depending on their configuration.
9.3. Precision - Recall Curve¶
plt.figure(figsize=(10, 6))
sns.scatterplot(data=metrics_df, x='Recall', y='Precision', hue='Model', style='Approach', s=100)
plt.title('Precision vs. Recall')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()
Precision-Recall Tradeoff Curves:¶
Our models show very different behavior in terms of recall vs. precision. A PR curve would visualize this trade-off per model and help justify model selection depending on your screening objective (e.g. prioritizing fewer false negatives vs. fewer false positives).
from sklearn.metrics import precision_recall_curve, average_precision_score
plt.figure(figsize=(10, 6))
models_to_plot = targets_df['Model'].unique()
for model in models_to_plot:
model_data = targets_df[targets_df['Model'] == model]
# Get true labels and predicted labels
y_true = model_data['Label']
y_pred = model_data['Prediction'] # binary (0/1)
# Use binary prediction as if it were a score (not ideal, but okay here)
precision, recall, _ = precision_recall_curve(y_true, y_pred)
ap = average_precision_score(y_true, y_pred)
plt.plot(recall, precision, marker='o', label=f"{model} (AP = {ap:.2f})")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve per Model (binary predictions)")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
9.4. Runtime vs. F1 Score Tradeoff¶
# Merge on Dataset + Model + Approach
merged_df = metrics_df.merge(
experiments_df[['Dataset_Name', 'Model', 'Approach', 'Duration_Seconds']],
on=['Dataset_Name', 'Model', 'Approach'],
how='inner'
)
plt.figure(figsize=(10, 6))
sns.scatterplot(
data=merged_df,
x='Duration_Seconds',
y='F1_Score',
hue='Model',
style='Approach',
s=100
)
plt.title("Runtime vs F1 Score Tradeoff")
plt.xlabel("Runtime (seconds)")
plt.ylabel("F1 Score")
plt.grid(True)
plt.tight_layout()
plt.show()
How do datasets affect runtime or model success?¶
# Create 'Approach_Model' column by combining 'Approach' and 'Model' columns
experiments_df['Approach_Model'] = experiments_df['Approach'] + ' - ' + experiments_df['Model']
plt.figure(figsize=(14, 7))
sns.stripplot(
data=experiments_df,
x='Dataset_Name',
y='Duration_Seconds',
hue='Approach_Model',
dodge=True,
alpha=0.6,
jitter=True
)
plt.title("Median Runtime by Dataset and Approach/Model")
plt.xticks(rotation=45)
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.legend(title='Approach - Model', bbox_to_anchor=(1.02, 1), loc='upper left')
plt.show()
9.5. Statistical Significance Testing of F1 Score Differences for Model/Approach Pairs¶
import numpy as np
from scipy.stats import ttest_rel, mannwhitneyu
import itertools
np.random.seed(42)
def paired_ttest(a, b):
n = min(len(a), len(b))
if n < 2:
return None
aa = a.sample(n=n, random_state=42).reset_index(drop=True)
bb = b.sample(n=n, random_state=42).reset_index(drop=True)
try:
stat = ttest_rel(aa, bb)
return stat.statistic, stat.pvalue
except Exception:
s = mannwhitneyu(aa, bb, alternative='two-sided')
return None, s.pvalue
# Use correct score column name
score_col = "F0.5_Score"
approaches = metrics_df["Approach"].dropna().unique().tolist()
if len(approaches) >= 2:
pairs = list(itertools.combinations(sorted(approaches), 2))
print('Paired tests for Approaches (F0.5):')
for a1, a2 in pairs:
s1 = metrics_df.loc[metrics_df["Approach"] == a1, score_col]
s2 = metrics_df.loc[metrics_df["Approach"] == a2, score_col]
res = paired_ttest(s1, s2)
if res is None:
print(f" {a1} vs {a2}: not enough data")
else:
stat, p = res
test_name = 'paired t-test' if stat is not None else 'Mann-Whitney'
print(f" {a1} vs {a2}: p-value = {p:.4f} ({test_name})")
else:
print('Not enough distinct approaches for statistical comparison.')
Paired tests for Approaches (F0.5): Active vs FewShot: p-value = 0.1091 (paired t-test) Active vs ZeroShot: p-value = 0.0000 (paired t-test) FewShot vs ZeroShot: p-value = 0.3005 (paired t-test)
Commentary¶
To assess whether the performance differences between approaches are statistically significant, we conducted paired t-tests using the F0.5 score across all experiments. The F₀.₅ score emphasizes precision more than recall, in line with our study's objective to optimize literature screening efficiency.
Insights:¶
- The Active learning approach significantly outperforms ZeroShot, suggesting that models benefit from incremental feedback even in low-resource scenarios.
- However, the performance difference between Active vs FewShot and FewShot vs ZeroShot is not statistically significant, indicating that the gain from using 1 negative sample (FewShot) over zero (ZeroShot) may be marginal in general performance.
- This supports our earlier observations that ZeroShot performs consistently worse, while Active learning yields the best results overall.
These results reinforce the utility of Active learning when resources allow iterative labeling, and highlight that adding minimal supervision (FewShot) may not always lead to significant improvements unless followed by active updates.